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Abstract  —  This  paper  introduces  a  novel  method  to 
score  how  well  proposed  fused  image  quality  measures 
(FIQMs)  indicate  the  effectiveness  of  humans  to  detect 
targets  of  interest  in  fused  imagery.  The  human  de¬ 
tection  performance  is  measured  via  human  perception 
experiments.  A  good  FIQM  should  relate  to  perception 
results  in  a  monotonic  fashion.  The  new  method,  the 
diffuse  prior  monotonic  likelihood  ratio  (DPMLR)  test, 
compares  the  FI \  hypothesis  that  the  intrinsic  human  de¬ 
tection  performance  is  related  to  the  FIQM  via  a  mono- 
tonic  function  to  the  null  hypothesis  that  the  detection 
and  image  quality  relationship  is  random.  The  paper 
discusses  many  interesting  properties  of  the  DPMLR. 
and  demonstrates  the  effectiveness  of  the  DPMLR  test 
via  Monte  Carlo  Simulations.  Finally,  the  DPMLR  is 
used  to  score  FIQMs  over  35  scenes  implementing  var¬ 
ious  image  fusion  algorithms. 

Keywords:  Image  fusion,  fused  image  quality  mea¬ 
sures,  hypothesis  test,  monotonic  correlation. 

1  Introduction 

In  recent  years,  image  fusion  has  been  attracting  a 
large  amount  of  attention  in  a  wide  variety  of  applica¬ 
tions  such  as  concealed  weapon  detection  [1],  remote 
sensing  [2],  intelligent  robots  [3],  medical  diagnosis  [4], 
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and  military  surveillance  [5] .  Image  fusion  refers  to  gen¬ 
erating  a  fused  image  in  which  each  pixel  is  determined 
from  a  set  of  pixels  in  each  source  image.  The  fused  im¬ 
age  should  contain  a  better  view  of  the  scene  than  do 
any  of  the  source  images,  thus  improving  computer  or 
human  interpretation.  The  interested  reader  is  referred 
to  Chapter  1  of  [6]  for  a  survey  of  various  image  fusion 
algorithms  developed  in  past  years. 

Measuring  the  performance  of  image  fusion  algo¬ 
rithms  is  an  extremely  important  task  which  has  re¬ 
ceived  past  study  [7-21].  The  performance  of  image 
fusion  algorithms  is  primarily  assessed  by  perceptual 
evaluation  in  the  form  of  subjective  human  tests  [13]. 
In  these  tests,  human  observers  are  asked  to  view  a 
series  of  fused  images  and  rate  them.  Although  the 
subjective  tests  are  typically  accurate  if  performed  cor¬ 
rectly,  they  are  inconvenient,  expensive  and  time  con¬ 
suming.  Hence,  we  desire  an  objective  performance 
measure  that  can  accurately  predict  human  perception. 
Note  that  here  we  refer  to  the  metrics  and  features  pro¬ 
posed  for  evaluating  the  quality  of  the  fused  images  as 
fused  image  quality  measures  (FIQMs).  In  the  litera¬ 
ture,  there  are  three  broad  classes  of  FIQMs.  The  first 
class  requires  a  reference  fused  image  (or  the  ground 
truth  image),  while  the  others  don’t.  In  some  special 
cases  (for  instance,  the  multi- focus  image  fusion  [8]),  it 
is  possible  to  generate  such  a  reference  image.  Once  the 
ground  truth  image  is  given,  we  can  use  existing  quality 
metrics  such  as  the  mean  square  error  and  the  peak  sig¬ 
nal  to  noise  ratio  to  compare  the  experimental  fused  re¬ 
sults  with  the  reference.  However,  in  many  applications 
generating  the  ideal  fused  image  is  usually  very  diffi¬ 
cult.  For  this  reason  we  do  not  consider  FIQMs  which 
requre  reference  image  in  this  paper.  Another  class  of 
FIQMs  introduced  recently  have  received  a  lot  of  atten¬ 
tion  [9-12].  These  measures,  see  [14],  consider  the  sum 
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of  correlations  between  each  source  image  and  the  fused 
image,  which  provides  a  measurement  of  the  amount  of 
information  transferred  from  the  source  images  to  the 
fused  image.  The  third  class  of  FIQMs  tries  to  extract 
the  salient  features,  such  as  the  structure,  texture,  con¬ 
trast  and  edge  information,  directly  from  the  fused  im¬ 
age  without  regard  to  the  source  images  [17—21] .  Com¬ 
parisons  between  existing  FIQMs  have  been  lacking. 

Quantitatively  evaluating  the  image  fusion  perfor¬ 
mance  is  a  complicated  issue  because  of  the  lack  of 
a  complete  understanding  of  the  human  visual  system 
(HVS),  and  because  of  the  variety  of  image  fusion  ap¬ 
plications  [15].  We  expect  that  the  FIQM  should  be 
task  specific,  and  the  best  measure  changes  from  task 
to  task.  Given  an  image  fusion  application  and  many 
kinds  of  proposed  FIQMs,  we  are  interested  in  which 
quality  measure  describes  the  image  fusion  performance 
better.  Clearly,  a  good  FIQM  must  be  related  to  how  a 
human  would  judge  the  quality  of  the  fused  image  in  a 
monotonic  fashion.  Therefore,  a  statistic  which  quan¬ 
tifies  how  well  different  FIQMs  are  consistent  to  actual 
human  performance  is  necessary,  which  is  the  focus  of 
this  paper. 

In  [16],  Pearson  (or  linear)  correlation  and  root 
mean  squared  error  (RMSE)  are  used  to  score  poten¬ 
tial  FIQMs.  The  Pearson  correlation  is  the  most  com¬ 
mon  method  to  determine  whether  or  not  the  input  and 
output  sequences  are  related.  It  quantifies  how  well  a 
straight  line  fits  mapping  between  the  input  and  out¬ 
put  sequences.  Unfortunately,  when  the  relationship 
between  the  quality  measure  and  the  human  perfor¬ 
mance  is  nonlinear,  the  value  of  Pearson  correlation 
can  be  small  despite  the  fact  that  the  sequences  are 
still  monotonically  related.  In  essence,  a  proper  statis¬ 
tic  needs  to  determine  if  the  ordering  of  quality  mea¬ 
sures  preserve  the  ordering  of  the  corresponding  human 
performance  measures.  A  nonlinear  correlation  coef¬ 
ficient  referred  to  as  the  monotonic  correlation  (MC) 
has  be  proposed  in  [17].  The  MC  is  more  general  than 
the  Pearson  correlation  and  exploits  the  monotonic  re¬ 
gression  between  the  quality  measures  and  the  human 
observations.  However,  it  assumes  that  the  perception 
error  is  Gaussian,  which  should  be  fine  for  a  large  num¬ 
ber  of  observers.  In  this  paper,  we  take  a  different  ap¬ 
proach.  We  focus  on  cases  where  the  fused  image  is  to 
be  used  for  object  detection.  Performance  is  measured 
by  the  probability  that  a  human  observer  can  correctly 
detect  certain  objects  of  interest  in  the  fused  image. 
We  introduce  a  new  monotonic  statistic  for  the  object 
detection  task  where  the  underlying  perception  results 
should  follow  a  binomial  distribution  and  the  number 
of  observers  is  small. 

The  paper  is  organized  as  follows.  Section  2  presents 
the  perception  model  and  introduces  the  new  mono¬ 
tonic  statistic.  Section  3  demonstrates  the  effectiveness 
of  the  new  statistic  via  Monte  Carlo  simulations.  The 
statistic  is  used  to  score  potential  FIQMs  against  ac¬ 


tual  perception  results  for  fused  image  interpretation 
in  Section  4.  Finally,  Section  5  provides  some  conclud¬ 
ing  remarks. 

2  Monotonic  Statistic 

2.1  Data  Models 

This  paper  considers  the  detection  task  so  that  the 
performance  of  image  fusion  algorithms  is  the  proba¬ 
bility  that  a  human  observer  can  correctly  detect  cer¬ 
tain  objects  of  interest  in  the  fused  image.  A  scene 
is  a  realization  of  F  source  images,  and  N  fused  im¬ 
ages  are  generated  from  these  F  images  via  N  different 
algorithms.  The  existence  (or  lack)  of  a  monotonic  re¬ 
lationship  between  measured  human  performance  and 
computed  FIQMs  can  be  inferred  over  S  scenes.  To 
this  end,  this  subsection  provides  the  data  models  that 
enable  this  inference. 

For  a  given  scene,  let  the  IV  x  1  vector  p  denote  the 
actual  performance  for  all  fusion  methods,  where  pi  is 
the  object  detection  probability  associated  to  the  i-th 
fused  image.  The  value  of  p  is  unobservable.  It  can  only 
be  inferred  via  perception  experiments  that  measure 
y  where  yi  is  the  number  of  observers  that  correctly 
detect  the  targets  in  the  i- th  fusion  image.  We  use  Oi 
to  represent  the  number  of  observers  that  participate 
in  the  detection  experiment  for  the  i-th  fusion  image. 
It  is  reasonable  to  model  y  as  a  random  vector  whose 
elements  are  statistically  independent  where  yi  is  drawn 
from  a  binomial  distribution  with  parameters  Oi  and  pi , 
i.e., 

N  /  \ 

y  ~  /(y|«,p)  =  n  (  )pT  (i -Pi)°~Vi-  (i) 

*= i 

Here  we  collect  (oi  =  •  ■  •  =  ojv)  in  an  A  x  1  vector  o 
for  convenience. 

Let  a  given  FIQM  evaluated  over  the  N  images  be 
denoted  as  x.  The  measure  value  Xi  is  a  deterministic 
function  of  the  *-th  fused  image  and  the  F  source  im¬ 
ages.  However,  over  the  ensemble  of  all  possible  scenes, 
the  value  of  Xi  can  be  viewed  as  a  statistical  quantity. 
The  proposed  monotonic  hypothesis  test  evaluates  how 
well  a  FIQM  monotonically  relates  to  human  object  de¬ 
tection  performance.  Under  the  monotonic  hypothesis, 
there  is  a  monotonic  function  that  maps  the  measure 
value  Xi  associated  to  the  *-th  fusion  method  to  the 
detection  probability  p*,  i.e., 

Pi  =  g{xi),  (2) 

where  g{x)  is  a  monotonic  increasing  or  decreasing  func¬ 
tion  of  x.  For  notational  convenience,  we  index  the  N 
image  fusion  algorithms  in  ascending  order  of  the  cor¬ 
responding  measure  values,  i.e.,  x\  ^  Xi  ^  ...  ^  Xn- 
Thus,  we  consider  two  alternative  H\  hypotheses:  H j 
for  ascending  p,’s  and  Hi  for  descending  p,’ s.  On  the 
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other  hand,  the  null  hypothesis  is  that  over  the  ensem¬ 
ble  of  possible  fused  imagery,  the  Xi  s  are  i.i.d.  samples. 
Thus,  the  pd s  are  in  random  order  where  the  probabil¬ 
ity  of  any  permutation  of  the  order  is  equal. 

A  given  scene  is  a  realization  from  the  ensemble  of 
possible  source  images.  Therefore,  we  can  model  the 
detection  probabilities  as  being  drawn  from  a  random 
distribution.  We  use  an  uninformative  (or  diffuse)  prior 
for  the  hypotheses.  For  the  i/j ,  H±,  and  Hq  hypotheses, 
p  is  uniformly  distributed  over 

Pi  =  {P  :  0  ^  pi  <  . . .  <  pN  <  !}, 

Pi  =  {P  :  1  >  Pi  >  ■  ■  ■  >  Pn  >  0},  and  (3) 

Po  =  {P  :  0  <pi,...,pjv  <  1}, 

respectively.  Over  all  hypotheses,  we  model  the  ps’s  for 
each  scene  as  statistically  independent  of  each  other. 

2.2  Diffuse  Prior  Monotonic  Likelihood 
Ratio  Test 

The  proposed  monotonic  statistic  leads  to  a  hypoth¬ 
esis  test  that  is  designed  to  work  for  a  small  number 
of  observers.  It  exploits  the  binomial  distribution  of 
the  perception  results  by  considering  the  likelihoods  for 
each  of  the  hypotheses.  The  ascending  and  descending 
likelihoods  are  given  by  (1).  Because  the  ordering  of  the 
elements  of  y  and  o  are  random  for  the  null  hypothesis, 
the  likelihood  of  p  is  not  dependent  on  the  orderings  of 
the  observations.  In  short,  the  hypothesis  test  distin¬ 
guishes  between  the  three  following  likelihoods 

i(j?T|y.°,p)  =/(y|°.p)  forpe-pT, 

U#f|y>°,p)  =  /(y|o,p)  for  p  £  Pi, 

l(H0 |y,o,p)  =  At  Y,fh  f(Pjy\PjO,p)  for  p  £  V0, 

(4) 

where  Pj  is  one  of  the  N\  possible  N  x  N  permutation 
matrices.  The  resulting  hypothesis  tests  are  not  simple 
tests  because  p  is  not  observable.  One  could  resort  to  a 
generalized  likelihood  ratio,  but  the  resulting  test  is  not 
a  universally  most  powerful  test.  Alternatively,  we  use 
the  uninformative  model  for  p  as  given  in  Section  2.1 
and  calculate  the  expected  likelihood  via 

l(Hi\y,o)=[  l(Hi\y,o,p)f(p\Hi)dp,  (5) 

JVi 

where  i  £  {T,f,0}  and  f(p\Hi)  is  the  uniform  proba¬ 
bility  density  function  over  Vp. 

/( Plff.)  =  (  (A  dp)  1  p  £  Pl’  (6) 

(  0  otherwise. 

It  is  easy  to  see  that 

[  dp  =  1,  and  [  dp  =  f  dp  =  ^-.  (7) 

JVo  JVy  JVy  JV! 


The  diffuse  ascending  and  descending  likelihood  ra¬ 
tios  to  test  the  ffj  and  hypotheses,  respectively,  are 
given  by: 


*jv(y>°) 

Ajvty.o) 


z~(gT|y,o) 
KHo  |y,o) 

i(Hi,  y>°) 

l{H0\y,o) 


(8) 

(9) 


For  multiple  scenes,  the  overall  likelihood  ratios  are  the 
product  of  the  single  scene  likelihoods  due  to  the  unin¬ 
formative  model  of  p  given  in  Section  2.1.  The  likeli¬ 
hood  ratio  for  the  monotonic  relationship  is 


An  =  max 


~\_  A]v(ys,  °s)>  ^jv(ys>  °«) 


S=1 


S— 1 


(10) 


where  ys  and  os  are  the  number  of  correct  detections 
and  observations  for  the  s-th  scene,  respectively.  Unless 
it  is  required,  the  scene  index  is  implicit  for  the  sake  of 
notational  brevity.  We  refer  to  A^v  as  the  diffuse  prior 
monotonic  likelihood  ratio  (DPMLR).  When  An  >  1 
the  evidence  in  support  of  the  monotonic  hypothesis 
is  greater  than  that  of  the  null  hypothesis  where  the 
FIQM  behaves  as  noise  with  respect  to  human  perfor¬ 
mance.  As  A  tv  increases,  so  does  the  evidence  that 
the  FIQM  under  test  is  actually  a  good  measure.  The 
DPMLR  test  is  simply  accepting  the  monotonic  hy¬ 
pothesis  if  the  DPMLR  exceeds  a  given  threshold  value. 
Usually,  the  threshold  is  greater  than  one. 

2.3  Recursive  Computation 

To  our  knowledge,  a  closed  form  expression  for  (8) 
and  (9)  does  not  exist.  It  is  possible  to  calculate  the 
diffuse  likelihood  ratios  numerically.  However,  due  to 
the  multiple  integration  involved  in  the  expression,  the 
calculation  requires  large  computational  cost,  especially 
when  N  and  the  <Vs  are  large.  This  subsection  provides 
a  recursion  to  calculate  these  diffuse  likelihood  ratios. 

The  diffuse  likelihood  for  H0  can  be  simply  expressed 
as: 


N  ,  X 

l(H0\y,o)  =  JJ  (  l  )p(yi  +  l,Oi -  Vi +  1) 

i= 1 


where 

(3 (a,b)  =  f  za_1(l  -  z)b~ldz 

Jo 


(11) 


(12) 


is  the  Beta  function. 

Substituting  equations  (1),  (5)  and  (11)  into  (8),  the 
ascending  diffuse  likelihood  ratio  can  be  expressed  as: 


^3v(y.°) 


IV!  h(pN;  vn,on)  •  •  ■  h(pi;yi,oi)dp 
Ildi  PiVi  +  1  ,Oi-yi  +  1) 


(13) 


where 

h(p-,y,o)  =py(l  ~p)°~v.  (14) 
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By  considering  the  power  series  expansion  of  the  regu¬ 
larized  incomplete  Beta  function,  (13)  can  be  simplified. 
Specifically,  the  regularized  incomplete  Beta  function  is 
defined  as 


3.  The  product  Ajy(y,  o)  •  A^(y,  o)  <  1  where  equality 
occurs  if  and  only  if  Ajy(y,  o)  =  Ajv(y,  o)  =  1. 

4-  A]v(y,o)  <  N\  and  Ajv(y,o)  ^  IV! 


I(y,a,b)  = 


Jq  2a-1(l  —  z)6-1  dz 


P{a,b) 

and  the  power  series  expansion  for  I(y;  a,  b)  is 


(15) 


i(y;a,b)  = 


l 


a-\-b—  1 

E 


a  +  b  /3(j  +l,a  +  b-  j) 

3 — a 


y3{  i  -  y) 


a+b—l—j 


(16) 


Now  (13)  can  be  simplified  to: 

A3v(y-°)  = 

N\  “yt1  f)(j  +  2/2  +  1,  oi  +  02  +  2  —  2/2  —  j) 

01  +  2  j~^+i  +  i  +  2  -  i)P{yi  + 1, 02  -  s/2  + 1) 

Ip  ■■■  Jo3  Hpn;Vn,Qn)  ■  ■  ■  h(P2;j  +  2/2,01  +  02  +  1)  dp2  ■  ■  ■  dp  N 
Ili=3  0(yi  +  h°i-Vi  +  tyPU  +  S/2  +  1,  01  +  02  +  2  -  s/2  -  i) 

N\  “yt1  /3(j  +  s/2  +  1,  Oi  +  02  +  2  -  s/2  -  j) 

01+2  j~^+1  0(j  +  1,01+2-  j)p(  s/2  +  1,  02  -  2/2  +  1) 

A]v-1  (L?  +  ?/2 ,  J/3 ,  ■  ■  •  ,2/iv]',  [oi  +  02  +  1,03,  •  •  •  ,Ojv]')  ■ 

(17) 

Also  note  that  by  definition, 


Al(yi,oi)  =  1. 


(18) 


Thus  the  ascending  diffuse  likelihood  ratio  can  be  com¬ 
puted  numerically  via  the  recursion  defined  in  (17)  and 
(18).  A  similar  recursion  can  compute  the  descending 
diffuse  likelihood  ratio.  Alternatively,  one  can  use  (17) 
and  (18)  and  exploit  the  fact  that 


Ajv  (fed,  •  •  • .  Vn]',  [oi,  •  •  • ,  Oat]')  = 

=  A ([oi  -  j/i,  • .  •  ,ojv  -  vn]',  [oi,  ■  ■  ■  ,ojv]0  • 


(19) 


The  symmetric  relationship  in  (19)  can  be  proved  by  a 
simple  change  of  variables  in  (13). 

2.4  Properties 

For  the  common  case  that  the  number  of  observers 
in  the  perception  experiment  are  consistent  over  the 
different  fused  imagery,  i.e. ,  o/  =  o  for  1  ^  N  (o  = 
ol),  the  diffuse  likelihood  ratios  have  some  interesting 
properties: 

1.  If  j/i  =2/2  =  -  -  -  =  VN,  then  A^(y,o)  =  1  = 
Ajv(y>°)- 

2.  If  the  yi's  are  in  ascending  (or  descending)  order 
and  they  are  not  constant,  then  Ajy(y,  o)  >  1  (or 
Ajv(y,o)  >  1). 


The  first  property  states  that  when  all  observations 
are  equal,  one  can  not  distinguish  between  the  ascend¬ 
ing,  descending,  and  null  hypotheses.  This  is  due  to 
the  fact  that  all  orderings  of  the  observations  are  indis¬ 
tinguishable.  The  second  property  states  that  as  long 
as  the  human  performance  y  is  increasing  (or  decreas¬ 
ing)  in  concert  with  x,  the  diffuse  likelihood  ratio  will 
favor  the  ascending  H ^  (or  descending  H^)  over  the 
null  hypothesis  Hq.  The  third  property  states  that  the 
ascending  and  descending  hypotheses  can  never  both 
be  favored  over  the  null  hypothesis.  The  last  property 
states  that  the  upper  bound  for  the  diffuse  likelihood  ra¬ 
tios  is  given  by  the  number  of  order  permutations.  The 
bound  is  easy  to  confirm  by  inspection  of  (13)  where 
the  integral  in  the  numerator  is  bounded  above  by  the 
product  of  the  beta  functions  in  the  denominator. 

Due  to  space  limitations,  this  paper  omits  the  formal 
proofs  for  these  properties.  Note  that  we  have  yet  to 
prove  Property  2.  Calculations  via  (17)  and  (18)  for 
various  values  of  N  and  o  have  yet  to  identify  a  counter 
example. 

3  DPMLR  Performance  Analy¬ 
sis 

In  this  section,  we  justify  the  performance  of  the  pro¬ 
posed  DPMLR  test.  To  this  end,  we  generate  Monte 
Carlo  realizations  of  y,  x,  and  p.  Specifically,  the  pi  s 
are  generated  uniformly  over  [0,1].  For  the  monotonic 
hypothesis,  Xi  =  (pi)a-  For  the  null  hypothesis,  the 
Xi  s  are  i.i.d.  from  a  uniform  distribution.  For  either 
hypothesis,  the  yt  s  are  random  realizations  of  the  bi¬ 
nomial  distribution  (see  (1)).  For  a  given  hypothesis 
and  values  of  ol,  N,  and  a,  we  generated  10000  real¬ 
izations  of  y,  x,  and  p,  and  we  computed  the  associ¬ 
ated  DPMLR  given  one  scene,  i.e.,  5=1.  Then  we 
use  the  histograms  of  the  DPMLR  to  generate  ROC 
curves  by  varying  the  acceptance  threshold  and  tabu¬ 
lating  the  number  of  acceptances  under  the  monotonic 
hypothesis,  i.e.,  probability  of  detection  (Pd),  and  un¬ 
der  the  null  hypothesis,  i.e.,  probability  of  false  alarms 
(Pf).  As  a  means  of  comparison,  we  also  compute  ROC 
curves  associated  to  the  Pearson  correlation  and  mono¬ 
tonic  correlation  [17]  in  a  similar  fashion  over  the  same 
simulations. 

Fig.l  includes  ROC  curves  of  the  various  tests  of  cor¬ 
relation  between  x  and  y  for  three  cases  that  a  =  1,  2 
and  6.  For  each  case,  N  =  10  and  o  =  5.  In  these  plots, 
the  thick  solid,  thin  solid,  and  dotted  lines  denote  the 
ROC  curves  for  the  DPMLR  test,  the  monotonic  cor¬ 
relation  test,  and  the  Pearson  correlation  test,  respec¬ 
tively.  In  Fig. 1(a)  where  a  =  1,  the  Pearson  correlation 
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performs  better  than  the  monotonic  diffuse  likelihood 
ratio.  This  is  explained  by  the  fact  that  relationship 
between  x  and  y  is  actually  linear,  and  Pearson  cor¬ 
relation  exploits  the  actual  values  of  x  and  not  just 
the  ordering.  However,  as  the  g(x)  function  becomes 
more  nonlinear,  the  performance  of  the  Pearson  corre¬ 
lation  degrades.  The  performance  of  the  DPMLR  test 
is  robust  to  the  nonlinearity,  and  this  test  always  out¬ 
performs  the  monotonic  correlation. 

4  Perception  Results 

Long-wave  infrared  (LWIR)  and  image  intensified 
(II)  imagery  was  collected  in  a  simulated  military  oper¬ 
ation  in  an  urban  terrain  (MOUT)  environment.  The 
imagery  includes  interior  and  exterior  locations,  where 
there  were  either  none,  one,  two,  or  three  individuals 
in  the  scenario.  The  same  locations  were  collected  four 
times  for  the  cases  where  0-3  people  are  within  the  field 
of  view.  Individuals  who  were  in  the  field  of  view  were 
typically  obscured  by  objects  in  the  scene,  such  as  door¬ 
ways,  windows,  furniture,  and  tables.  For  each  of  the 
scenarios,  a  horizontal  pan  of  150  images  was  then  used 
to  create  a  larger  mosaic  of  imagery  in  both  the  LWIR 
and  II  bands. 

The  LWIR  and  II  images  were  registered,  bore- 
sighted  and  fused  via  3  different  algorithms.  These 
fusion  algorithms  include:  1)  Contrast  Pyramid  A 
(CONA),  2)  Contrast  Pyramid  B  (CONB)  [22]  and 
3)  Discrete  Wavelet  Transform  (DWTT)  [1,23,24].  The 
distinction  between  CONA  and  CONB  is  which  im¬ 
age  (LWIR  or  II)  populates  the  coarsest  coefficients  in 
the  pyramid.  Furthermore,  it  is  instructive  to  compare 
the  fused  imagery  against  the  source  imagery.  There¬ 
fore,  we  consider  five  fused  image  displays:  1)  CONA, 
2)  CONB,  3)  DWTT,  4)  II,  and  5)  LWIR.  Fig.  2  shows 
the  resulting  five  image  displays  for  one  or  the  scenar¬ 
ios.  In  this  scenario  there  are  two  target  persons  which 
are  highlighted  by  the  blue  boxes  in  each  image. 

A  perception  test  was  set  up  whereby  observers  were 
asked  to  try  to  find  these  target  persons  in  a  ’’field  of 
regard”  search.  An  observer’s  display  was  calibrated  to 
look  as  though  it  were  seeing  a  single  field  of  view  of  a 
given  scene,  and  the  observer  had  to  navigate  across  the 
scene  and  detect  human  targets.  Observers  could  mark 
as  many  as  three  places  on  the  display  as  detections 
for  human  targets  (as  they  were  told  that  the  images 
could  contain  between  zero  and  three  humans  hiding 
in  the  scene).  At  any  point  an  observer  could  push  a 
button  to  indicate  that  they  either  did  not  detect  any 
targets  in  the  scene  or  that  there  were  no  other  targets 
in  the  scene.  Even  though  the  observers  were  not  told 
to  detect  the  targets  as  quickly  as  possible,  the  time  in 
which  it  took  them  to  determine  targets  and  finish  the 
scene  were  recorded.  In  the  end,  the  detection  perfor¬ 
mance  of  the  humans  were  recorded  over  the  five  image 
displays  (three  fused  images  and  two  source  images). 


a  =  1 


(a) 


a  =  2 


(b) 


a  =  6 


(c) 


Figure  1:  ROC  curves  for  diffuse  likelihood  ratio  test, 
monotonic  correlation  test  and  Pearson  correlation  test: 
(a)  a  =  1,  (b)  a  =  2,  and  (c)  a  =  6. 
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m 

A1/35 

p- value 

m 

A1/35 

p- value 

m 

A1/35 

p- value 

m 

A1/35 

p- value 

i 

1.0722 

0.135023 

5 

0.5925 

0.174093 

9 

0.0382 

0.378940 

13 

0.0479 

0.360684 

2 

0.0204 

0.428441 

6 

0.3637 

0.207319 

10 

0.0422 

0.370878 

14 

0.0252 

0.412048 

3 

0.0301 

0.398212 

7 

0.0376 

0.380294 

11 

0.0316 

0.394350 

15 

0.0242 

0.415028 

4 

0.0340 

0.388389 

8 

0.0392 

0.376840 

12 

0.0362 

0.383343 

16 

0.0387 

0.377822 

Table  1:  List  of  geometric  means  and  p- values  of  DPMLR  for  all  16  FIQMs  . 


(d) 


(e) 


Figure  2:  Example  of  one  of  the  22  scenario  images: 
(a)  Contrast  pyramid  A,  (b)  contrast  pyramid  B,  (c) 
DWT,  (d)  II,  and  (e)  LWIR. 


As  seen  in  Fig.  2(e),  the  human  targets  stand  out 
in  the  LWIR  imagery  because  they  are  usually  hotter 
than  the  background.  For  the  most  part,  detection  per¬ 
formance  is  best  on  the  LWIR  only  band  because  the 
search  task  can  often  be  reduced  to  simply  finding  the 
white  hot  object  on  a  grey  background.  However,  the 
II  band  has  the  potential  to  add  context  to  the  LWIR 
band  as  the  objects  like  tables  and  chairs  are  easier 
to  distinguish  in  the  II  band  (see  Figs.  2(d)  and  (e)). 
Therefore,  there  can  be  value  in  fusing  the  two  bands. 

Overall,  o  =  8  observers  evaluated  22  scenarios  that 
contained  35  human  targets.  We  treat  each  actual  tar¬ 
get  location  as  a  scene,  where  the  scene  is  an  image  chip 
for  one  of  the  22  scenarios.  For  example,  the  inside  of 
the  blue  boxes  in  Fig.  2  represent  two  scenes.  Then, 
ys  is  the  number  of  observers  that  correctly  detected 
the  target  located  in  the  s-th  scene  for  s  =  1, . . . ,  35. 
Then,  we  computed  16  potential  FIQMs  over  each  fused 
image.  These  FIQMs  are  listed  in  Table  2  with  corre¬ 
sponding  citations.  The  first  10  measures  are  simply 
complexity  features  that  do  not  consider  the  source  im¬ 
ages  (the  third  class  according  to  the  classification  in 
Section  1).  The  last  6  measures  compare  how  well  the 
salient  features  in  the  two  source  imagery  are  trans¬ 
ferred  into  the  fused  image  (the  second  class).  For  the 
most  part,  the  distinction  between  these  comparative 
measures  is  in  the  definition  of  saliency. 

All  but  the  contrast  feature  list  in  Table  2  were  also 
evaluated  in  [17]  for  a  recognition  task.  Furthermore, 
the  contrast  feature  is  the  only  FIQM  that  is  not  fully 
automated.  It  is  very  similar  to  the  Fechner- Weber 
contrast  measure  used  in  [18].  To  compute  the  con¬ 
trast,  the  human  silhouettes  were  manually  segmented 
for  each  scene.  We  considered  this  measure  because  it  is 
one  of  the  features  that  is  averaged  in  an  automated  Na¬ 
tional  Imagery  Interpretability  Ratings  Scale  (NIIRS) 
rating  [26].  Furthermore,  it  is  intuitive  that  contrast 
between  the  target  and  the  background  facilitates  ease 
of  detection. 

Table  1  provides  the  DPMLR  score  over  the  35  scenes 
for  each  of  the  16  measures  as  well  as  the  correspond¬ 
ing  p- values.  Actually,  the  table  provides  the  geometric 
mean  of  the  ascending  or  descending  diffuse  likelihood 
ratios.  The  geometric  mean  provides  a  convenient  way 
to  normalize  the  score  against  the  number  of  scenes. 
The  DPMLR  scores  for  all  but  the  contrast  measure 
are  significantly  less  than  one.  This  means  that  the  ev¬ 
idence  points  to  the  fact  that  these  potential  FIQMs  are 
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Category 

Feature 

Number 

Feature  Description 

Contrast 

i 

'[lt j  lb^  where  It  is  the  average  intensity  of  the  target 

and  lb  is  the  average  intensity  of  the  background 

Saturation  [17] 

2 

Normalized  histogram  peak 

STD 

3 

Standard  deviation 

Schmieder  Weathersby  [19] 

4 

Block  average  local  standard  deviation 

FBM  [20] 

5 

Hurst  parameter  for  fBm  model 

TIR  [21] 

6 

Block  average  target  interference  ratio  (contrast) 

Energy  [21] 

7 

Block  average  energy  of  histogram 

Entropy  [21] 

8 

Block  average  entropy  of  histogram 

Homogeneity  [21] 

9 

Block  average  pixel  variation 

Block  Outlier  [21] 

10 

Block  average  number  of  outliers 

Universal  Quality  Index  [25] 

11 

Average  Structure  SIMilarity  (SSIM)  index 
between  fused  and  reference  images 

Information 

12 

Average  mutual  information  between  fused 

Measures  [11] 

and  reference  images  (bin  size  =  16) 

Objective  Measure  [10] 

13 

Average  objective  edge  information 
between  fused  and  reference  images 

14 

Weighted  average  salient  quality  index  of  edge 
intensities  between  fused  and  reference  images 

Salient  Quality  Index  [12] 

15 

Weighted  average  salient  quality  index  between 
fused  and  reference  images 

16 

Average  salient  quality  index  between 
fused  and  reference  images 

Table  2:  List  of  FIQMs  tested  in  this  paper. 


viewed  as  noise  with  respect  to  ordering  the  detection 
probabilities  of  the  imagery.  For  the  contrast  measure, 
the  geometric  mean  DPMLR  score  is  still  modest  at 
1.0722  and  the  p- value  is  not  very  low.  In  fact,  an  ideal 
FIQM  that  consistently  ordered  the  number  of  detec¬ 
tions  y  over  all  35  scenes  would  provide  a  DPMLR  with 
a  geometric  mean  of  9.632.  This  means  that  while  there 
is  evidence  to  reject  the  null  hypothesis,  the  evidence 
to  support  the  monotonic  hypothesis  is  not  compelling. 
However,  the  DPMLR  score  for  the  contrast  measure 
is  much  greater  than  the  scores  for  the  others.  Thus, 
the  contrast  feature  may  be  a  key  aspect  to  a  proper 
FIQM. 

5  Conclusions 

In  this  paper,  we  propose  the  DPMLR  to  quantify 
how  well  a  FIQM  matches  with  the  human  derived 
probability  of  detection.  The  paper  discusses  some  in¬ 
teresting  properties  of  the  DPMLR,  and  simulation  re¬ 
sults  demonstrate  the  advantages  of  the  DPMLR  over 
other  linear  and  monotonic  correlation  methods.  Unlike 
the  monotonic  correlation  in  [17],  the  DPMLR  seam¬ 
lessly  accounts  for  the  spread  of  the  human  observations 
and  the  number  of  fused  images.  It  indicates  to  what 
degree  the  ordering  of  the  human  observations  by  the 
FIQM  is  not  by  random  chance.  Finally,  the  DPMLR 
was  used  to  score  a  number  of  potential  FIQMs  using 
real  image  data  with  a  corresponding  perception  study. 

The  DMPLR  scores  reveal  that  a  proper  FIQM  for 
the  detection  task  is  not  yet  available.  The  comparative 
measures  may  have  scored  poorly  because  the  salient 
features  exploited  by  these  measures  may  not  have  cap¬ 


tured  the  context  in  II  imagery  that  humans  exploit  for 
detection.  On  the  other  hand,  the  contrast  measure 
does  demonstrate  some  utility  based  on  its  DMPLR 
score.  Future  work  can  focus  on  the  search  of  a  more 
appropriate  FIQM.  Such  a  measure  may  incorporate 
aspects  of  the  contrast. 

While  the  DPMLR  has  many  interesting  properties, 
it  is  based  upon  some  simplifying  assumptions.  For 
instance,  it  assumes  that  the  observers’  probability  of 
false  alarms  are  calibrated.  Furthermore,  the  evalua¬ 
tion  of  the  image  quality  over  chips  in  the  larger  sce¬ 
nario  images  ignores  some  contextual  information.  Fu¬ 
ture  research  should  focus  on  statistical  scoring  mecha¬ 
nisms  that  account  for  increasingly  realistic  data  mod¬ 
els. 
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